DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods
Garbage in, garbage out
— Virtually all users of data
A very “recent” example is the 2018 Census carried out by Stats NZ. If you are interested in how Stats NZ addressed these concerns
To ensure that the data you collect…
On average, how “far away” is our estimate of the “truth”
There can be multiple reasons for any biases in our estimates of the “truth”, for example:
On average, how variable is the estimate of the “truth”
Like bias, there can be multiple reasons for imprecise estimates of the “truth”, for example:
A population includes all individuals or objects of interest. In this context, data are typically collected from a sample, which is a subset of the population.
It can be difficult to measure and categorise variables for the target population
Nevertheless, a census of the target population means that anything calculated from the collected variables are the ground “truth”
A well-designed sample of the target population means that we get an accurate estimate of the ground “truth”
Source: www.andrewchen.nz/polls
The ground “truth” is known as a (population) parameter. For example:
The estimate of a parameter, based on our sample data, is known as a statistic
All potential observations have the same chance to be selected for the sample
Why should we take random samples from the target population?
The survey literature makes a distinction between sampling errors, which arise from the decision to take a sample rather than trying to survey the whole population (which is what a census tries to do)
— Wild & Seber (2000)
Sometimes—by chance—we can select a “bad” simple random sample that may not be representative of the population
[1] 6 9 1 7 3 5 4 10 2 8
\[ (1/100)^{10} = 1 \times 10^{-20} \]
In light of Slide 12, we can demonstrate that—on average—a statistic calculated from a simple random sample is an unbiased estimate of the (unknown) parameter
Why? Simple random samples1 provide a neat “predictable” property, even though randomness is involved
More on this once we start the Introduction to Statistical Inference topic
All potential observations are first split into distinct groups (strata). Then we take a simple random sample of all potential observations within each group (stratum)
Why are stratified samples useful?
Recall the following piece of context:
The survey was an annual snapshot to produce income statistics on New Zealanders aged 15 and over based on a representative sample of the population.
One strategy to get a representative sample of the population is to conduct a stratified sample. Why?
Here are the population proportions calculated for 20061
Region
Auckland BoP Christchurch Gisborne Manawatu Nelson
0.33 0.06 0.13 0.05 0.05 0.04
Northland Otago Southland Taranaki Waikato Wellington
0.04 0.05 0.02 0.03 0.09 0.11
The variable is plotted side-by-side for each level of the categorical variable. The goal of side-by-side plots is to compare and contrast the plotted variable between levels
With the lattice R package, we can make side-by-side plots of:
stripplot()bwplot()barchart() (and with colour)The variable is plotted for each level of the categorical variable in its own panel. The goal of panel plots is to also compare and contrast the plotted variable between levels
With the lattice R package, we can make panel plots of:
stripplot()bwplot()histogram()xyplot()bachart()The variable is plotted and the values are colour-coded by the levels of the categorical variable. The goal of colour is to help distinguish between levels
With the lattice R package, we can make use of colour in:
stripplot()xyplot()bachart() (and with side-by-side bars)Producing descriptive statistic(s) of a numeric variable for each level of a categorical variable requires additional R code:
# The standard set of summary statistics of income by region
split(nzis.df, ~ region) |>
lapply(\(x) summary(x$income))$Auckland
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5100.0 243.0 542.0 720.2 989.0 25443.0
$BoP
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1688.0 197.0 470.0 619.7 901.0 11603.0
$Christchurch
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5100.0 246.0 545.5 701.0 970.0 16174.0
$Gisborne
Min. 1st Qu. Median Mean 3rd Qu. Max.
-518.0 243.2 521.5 692.5 924.8 21104.0
$Manawatu
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2379.0 234.0 559.0 655.7 948.2 10081.0
$Nelson
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4420.0 232.0 518.0 679.5 941.5 17538.0
$Northland
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3551.0 247.5 540.0 666.7 938.0 6967.0
$Otago
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3551.0 256.0 570.0 685.6 970.0 8439.0
$Southland
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1308.0 200.0 491.0 647.8 994.0 4631.0
$Taranaki
Min. 1st Qu. Median Mean 3rd Qu. Max.
-413.0 223.5 469.0 634.1 884.5 5866.0
$Waikato
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2110.0 223.0 518.5 647.2 928.8 14782.0
$Wellington
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3551.0 254.5 574.0 729.5 1001.0 18369.0
# The sample standard deviation of income by region
split(nzis.df, ~ region) |>
lapply(\(x) sd(x$income))$Auckland
[1] 885.168
$BoP
[1] 682.2891
$Christchurch
[1] 810.055
$Gisborne
[1] 959.4897
$Manawatu
[1] 646.0911
$Nelson
[1] 850.093
$Northland
[1] 711.9051
$Otago
[1] 670.9319
$Southland
[1] 630.7546
$Taranaki
[1] 658.1127
$Waikato
[1] 681.1972
$Wellington
[1] 840.5936
We have so far introduced concepts relating to taking samples from the population. Recall the following point raised in Slide 4
Our sample data is not representative of the population
In practice, taking random samples from the correct population to answer a research question is the hardest part!
What we discussed on the previous slide is known as Selection Bias, when the target population and sampling frame do not intersect
Some other common issues for a variety of fields are:
Cluster sampling
All potential observations have a chance to be selected for the sample. However, the researcher selects groups of units
Systematic sampling
All potential observations have a chance to be selected for the sample by
Self-selected sampling
All potential observations choose whether they are selected for the sample
Choice sampling Judgement sampling
The researcher(s) choose which potential observations are selected for the sample
A randomised experiment is a study in which the researcher actively controls one or more explanatory variables. Additionally, the values of the explanatory variable(s) are randomly assigned to the units before the response variable is measured.
Ronald Fisher1 identified three important principles that should be considered when designing an experiment
Replication to judge if the observed differences in the experimental data are due to a “signal” rather than “noise”. Practical constraints often limit the number of replicates
Randomisation of the replication order and explanatory variable values we hypothesise cause a “signal” to justify that each observation is independent of another. Furthermore, randomisation ensures that each explanatory variable value has the same chance of being assigned to “good” or “bad” observations
Incorporate blocks, where possible, to ensure that the observed differences in the experimental data are calculated from groups of similar observations.
Response variable Outcome, Dependent
A variable that we believe changes in value because of the explanatory variable
Explanatory variable Treatment, Independent
A variable that we use to understand how the response variable changes in value. In randomised experiments, we often control the values of the explanatory variable(s)
Association
Two variables are associated if values of one variable tend to be related to the values of the other variable
Causation
Two variables are causally associated if changing the value of one variable influences the value of the other variable
What is the difference between association & causation?
Causation means that changes in the explanatory produces predictable changes in the response, but not the other way around
A confounding variable (confounders) is a third variable that is associated with both the explanatory variable and response variable. A confounding variable can offer a plausible explanation for an association between two variables of interest.
— Lock et al. (2021)
The goal of randomised experiments is to “break” any potential confounding variable(s) with random assignment of the explanatory variable1
Simon Newcomb1 experimented with a new method of measuring the speed of light in 1882, which involved using two different mirrors placed approximately 3721.865 metres apart. The following data comes from 20 repeated measurements of the passage time for light to travel from one mirror to another and back again.
The theoretical passage time for the above distance was 24.8296 millionths of a second. If this new method is unbiased and precise, the experimental data should agree with the theoretical passage time.
| Variables | |
|---|---|
| pass.time | A number denoting the passage time for light to travel from one mirror to another and back again (millionths of a second, μs) |
stripplot( ~ pass.time, data = lightspeed.df, jitter.data = TRUE,
factor = 5, main = "Measurements of passage time for light from Newcomb's experiment",
xlab = "Passage time (millionths of a second)")Recall that the theoretical passage time for Newcomb’s experiment was 24.8296 millionths of a second.
A professor carried out an experiment to determine the best calcium level to ensure that fish have low respiration rates. The fish were randomly asssigned to three tanks with different levels of calcium.
| Variables | |
|---|---|
| Calcium | A factor denoting the calcium level of the tank, Low, Medium or High |
| GillRate | A number denoting the respiration rate of the fish (gill beats per minute, gbpm) |
$High
Calcium GillRate
Length:30 Min. :37.00
Class :character 1st Qu.:45.75
Mode :character Median :58.50
Mean :58.17
3rd Qu.:68.00
Max. :85.00
$Low
Calcium GillRate
Length:30 Min. :44.00
Class :character 1st Qu.:55.50
Mode :character Median :65.00
Mean :68.50
3rd Qu.:84.75
Max. :98.00
$Medium
Calcium GillRate
Length:30 Min. :33.00
Class :character 1st Qu.:46.00
Mode :character Median :59.50
Mean :58.67
3rd Qu.:68.75
Max. :83.00
Recall that we want to judge if the observed differences in the experimental data are due to a “signal” rather than “noise”
In a block design, experimental units (observations) are first divided into homogeneous groups called blocks, and each treatment is randomly assigned to one or more units within each block
— Utts & Heckard (2015)
Strawberries
Consider designing an experiment to find out if the application of herbicides would harm the growth of strawberry plants. You have four kinds of herbicide, A–D, and one control treatment
Driving Impairment
Consider designing an experiment to determine if the following treatments can causally explain driving ability: alcohol, marijuana, or sober
An observational study is a study in which the researcher does not actively control the explanatory variable(s). The researcher simply observes the values of the explanatory variable(s) as they naturally exist
Retrospective Studies
The researchers sample observations to make inferences based on variables, some of which were measured or categorised previously
Cross-sectional Studies
The researchers sample observations to make inferences based on variables made at a specific point in time
Prospective Studies (Longitudinal Studies)
The researchers sample observations to make inferences based on how variables change over time
This dataset describes an observational study of youths who lived in East Boston, Massachusetts, USA, sometime during the 1970s. The researchers followed these youths for seven years, and their primary research question was whether smokers suffered from reduced lung capacity.
| Variables | |
|---|---|
| Age | An integer denoting the age of a subject (in years) |
| Height_In | A number denoting the height of a subject (in inches) |
| Sex | A factor denoting the sex of the subject, male or female |
| Smoke | A factor denoting the smoking status of the subject, non-smoker or smoker |
| LungCap | A number denoting the lung capacity of the subject (unitless) |
Age Height_In Sex Smoke
Min. : 3.000 Min. :46.00 Length:654 Length:654
1st Qu.: 8.000 1st Qu.:57.00 Class :character Class :character
Median :10.000 Median :61.50 Mode :character Mode :character
Mean : 9.931 Mean :61.14
3rd Qu.:12.000 3rd Qu.:65.50
Max. :19.000 Max. :74.00
LungCap
Min. :0.791
1st Qu.:1.981
Median :2.547
Mean :2.637
3rd Qu.:3.119
Max. :5.793
Remember the goal of side-by-side plots is to compare and contrast
Remember the goal of side-by-side plots is to compare and contrast
Random assignment of the explanatory variable values to observations versus (Random) sampling of observations whose explanatory variable values are simply observed.
Clinical trials can be designed as a randomised experiment to quantify the effectiveness of a treatment, e.g. vaccination and no vaccination (placebo)
Case-control studies can be designed as a stratified sample to compare two groups, e.g. vaccinated and unvaccinated